Assessment of approximate string matching in a biomedical text retrieval problem
نویسندگان
چکیده
Text-based search is widely used for biomedical data mining and knowledge discovery. Character errors in literatures affect the accuracy of data mining. Methods for solving this problem are being explored. This work tests the usefulness of the Smith-Waterman algorithm with affine gap penalty as a method for biomedical literature retrieval. Names of medicinal herbs collected from herbal medicine literatures are matched with those from medicinal chemistry literatures by using this algorithm at different string identity levels (80-100%). The optimum performance is at string identity of 88%, at which the recall and precision are 96.9% and 97.3%, respectively. Our study suggests that the Smith-Waterman algorithm is useful for improving the success rate of biomedical text retrieval.
منابع مشابه
Indexing Methods for Approximate String Matching
Indexing for approximate text searching is a novel problem receiving much attention because of its applications in signal processing, computational biology and text retrieval, to name a few. We classify most indexing methods in a taxonomy that helps understand their essential features. We show that the existing methods, rather than completely diierent as they are regarded, form a range of solut...
متن کاملApproximate String Matching for Geographic Names and Personal Names
The problem of matching strings allowing errors has recently gained importance, considering the increasing volume of online textual data. In geotechnologies, approximate string matching algorithms find many applications, such as gazetteers, address matching, and geographic information retrieval. This paper presents a novel method for approximate string matching, developed for the recognition of...
متن کاملO(k) Parallel Algorithms for Approximate String Matching Approximate String Matching (proposed Running Head)
Given a text string T of length n, a shorter pattern string A of length m, and an integer k, an simple straightforward O(k) parallel algorithm for nding all occurrences of the pattern string in the text string with at most k di erences (as de ned by edit distance) is presented. The algorithm uses the priority CRCW-PRAM model of computation and (n m+ k + 2) m = O(n m) processors. Over recent dec...
متن کاملImproved Approximate Multiple Pattern String Matching using Consecutive Q Grams of Pattern
String matching is to find all the occurrences of a given pattern in a large text both being sequence of characters drawn from finite alphabet set. This problem is fundamental in computer Science and is the basic need of many applications such as text retrieval, symbol manipulation, computational biology, data mining, and network security. Bit parallelism method is used for increasing the proce...
متن کاملA Hybrid Indexing Method for Approximate String Matching
We present a new indexing method for the approximate string matching problem. The method is based on a suffix array combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the average retrieval time is , for some that depends on the error fraction tolerated and the alphabet size . It is shown that for approximately , where . The space required is four times...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computers in biology and medicine
دوره 35 8 شماره
صفحات -
تاریخ انتشار 2005